Advanced B.3 - The Reproducible Workflow, End to End

🎯 What We'll Cover

Everything so far has been apparatus. This page runs it for real: a complete analysis of the messy Berg River data, from the first inspection to a result and a folder that someone else could open and reproduce. Then the two things that keep it honest — verifying work an agent did largely on its own, and disclosing what the AI actually did — where we find that the reproducible folder is itself the best disclosure you can give.

🔁 Git as a Reproducibility Trace

One piece of the structure deserves a word before the worked example, because it is the part most researchers skip. Git — the version-control system — records the state of your project over time. Each commit is a labelled snapshot; the history is a record of how the work actually evolved. You do not need to become a software engineer to benefit: even committing at a few milestones (“raw data inventoried,” “cleaning done,” “analysis complete”) gives you a trace that says what the project looked like at each stage, and lets you see exactly what changed between them.

Claude Code can do the Git work for you — initialise the repository, show you what changed, commit with a message — so the mechanics are not a barrier. What matters is the habit of using the history as a research record, not just a backup: a commit message like “halved upstream counts (double-counting fault, per field note); excluded contaminated D-03” turns the version history into a narrated account of your decisions. Used this way, Git is reproducibility infrastructure, not a developer's indulgence.

🔬 A Worked Reproducible Analysis

Here is the Berg River study, start to finish, with the discipline of the previous pages applied. The dataset is the same deliberately-messy archive from Lesson A.3 — downloadable below — and it has real problems planted in it: inconsistent column names, a systematic counting fault at one site, a contaminated outlier, missing values, and field notes that explain anomalies the spreadsheets don’t. Watch how the structure turns that mess into something inspectable — including one correction that, it turns out, decides the answer.

Put the CLAUDE.md in place. The research-habits template from B.2 goes in the project root before anything else, so every rule — don't touch data/raw/, log decisions, don't guess — is active from the first action.
Inspect and inventory. In plan mode, the agent reads every raw file and writes docs/data-inventory.md: the columns (and the fact that the three site files name them differently), the units, the row counts, the missing values, and the anomalies — including upstream counts that sit suspiciously close to the town’s (surprising for a site above the town), and one downstream value far higher than its neighbours. Crucially, field-notes.md explains both: the upstream counter double-counted all season, and the high downstream sample (D-03) was contaminated and should be discarded.
Pre-register the question. Before running the comparison, the prediction and decision rule from B.2 go into pre-registrations/ and are committed. The standard is set while there is nothing to defend.
Plan the analysis; you review it. Still in plan mode, the agent proposes the cleaning and analysis steps. You read the plan — this is the checkpoint — and approve it before a single file is written.
Clean, and log every decision. The agent writes a cleaning script to scripts/ that harmonises the column names, halves the upstream counts because the field notes record that the upstream counter double-counted all season, excludes the contaminated D-03, and handles the missing values by the pre-registered rule. Each of these choices is appended to notes/decision-log.md, dated, with its reason and its source. The raw files are never touched; the cleaned data lands in data/processed/. Watch the upstream halving in particular — it is about to decide the answer.
Run the analysis to a committed log. The analysis script runs the pre-registered Mann-Whitney test, prints its result to outputs/analysis_run.log (committed), and saves the figure and summary table to outputs/.
Record assumptions and uncertainty. The agent notes what it assumed and what remains uncertain — the small sample after exclusions, the reliance on a single field note for the upstream correction that drives the result — rather than presenting a clean result as if nothing were in doubt.
Commit. A Git commit snapshots the whole state, with a message that narrates the decisions.

🧦 The test that matters: hand it to a stranger

Now the question reproducibility actually asks. Give the finished folder to someone who was not there. Can they reconstruct what happened? Yes — and not because they trust you, but because the evidence is all present: the untouched raw data, the script that transforms it, the decision log explaining every judgement call (why the upstream counts were halved, why D-03 was dropped), the pre-registration showing the standard was set in advance, the outputs, and the run log proving what was executed. They can re-run the script and get the same result, and they can disagree with a decision because they can see it.

And here is what makes that decision log earn its place. Without the upstream correction — if a hurried researcher never opened the field notes — this same analysis concludes that the town has no measurable effect: town and upstream read almost identically (a median difference of 2 particles/L, not remotely significant). With the correction, the town is clearly and significantly higher (a difference of 36, p ≈ 0.00002). One logged decision, sourced to a single field note, is the entire difference between the right answer and the wrong one. A reader of the folder can see exactly where the result turns; a chat transcript would have buried it.

Contrast the same analysis done in a chat window. The model gives you the same answer, perhaps in the same minute. But the cleaning decisions are scattered through a conversation you will lose; the upstream halving is a thing the model did somewhere in the middle that nobody recorded; and next month there is no folder to hand anyone — only a number you once obtained and can no longer account for. Same model, same answer, utterly different research.

📥 Download: the Berg River sample archive

The deliberately-messy starting files — the three inconsistent CSVs with their planted errors, the contradicting field notes, the stale README, and the metadata — so you can run this whole workflow yourself. berg-river-microplastics.zip

✅ Download: the worked solution (model answer)

The completed, reproducible folder — a filled CLAUDE.md, the pre-registration, the data inventory, the cleaning-and-analysis script, the decision log, the processed data, and the outputs (summary table, run log, and a figure). Running python3 scripts/analyse.py regenerates every output from the raw data, so you can watch the whole chain work and confirm the result for yourself. Do the exercise first; reach for this when you want to check your structure against a model answer. berg-river-microplastics-worked-solution.zip

🔎 Verifying Agentic Work

When the agent runs for an hour and makes a hundred changes you did not watch, the instinct is to inspect each one — to read every diff as it happens. That does not scale, and pretending it does is the fastest route to a false sense of safety. The volume is the problem: when so much is changing all the time, there often isn't a chance to check each change, and any carefulness that depends on watching every one will quietly fail. So the move is to take verification off the per-change axis. Don't try to catch every mistake as it happens; instead make mistakes survivable, encode what must stay true so a machine checks it, and verify the few things that actually matter. Week 9's protocols and Week 10's Princeton reliability finding (Rabanser et al., 2026) still bite — the burden is real and it grows — but the response is to automate and target verification, not to watch harder.

1. Make mistakes recoverable

The quiet hero. With data/raw/ read-only, everything regenerable from raw plus scripts, and all of it under Git, any bad change can be rolled back and any output rebuilt. You are no longer reviewing to prevent disaster — disaster is structurally off the table — so you can safely not watch every step, and find errors later, in batch.

2. Automate the checks

Write down what must stay true — row counts didn't drop, values stay in physical range, a known sub-result still holds — and have the agent assert it after changes. A Claude Code hook can run your check script automatically after every edit, so a silently-dropped column trips an assertion with no diff-reading at all. You read a one-line pass or fail, not the change.

3. Verify the result, not every step

You cannot check every edit, but you can check the end against ground truth: re-derive the one number that matters, trace a few output values back to the raw data by hand, sanity-check against what you know of the domain. A hundred edits collapse into a handful of checks on the thing you actually care about.

4. Let the decision log choose your diffs

You don't read every diff — you read the consequential ones. The “log every consequential decision” rule means the agent surfaces the handful that mattered (halved the double-counted upstream counts; excluded the contaminated D-03); those are the few worth your eyes, and the boilerplate you let go. The log is a high-signal substitute for the full diff.

These layers catch different failures, and it helps to see which catches which. A wrong-direction failure — the agent confidently pursuing the wrong goal because it inferred the wrong intent, and taking you on a tangent for days — is invisible to any amount of diff-reading, because no single change is wrong; the question is wrong. That is what the pre-registration gates from B.2 are for: a closed gate stops the tangent early. A quiet-correctness failure — a dropped column, a plausible default you didn't notice — is what the automated assertions catch. And reviewing at the checkpoint (the commit boundary) rather than at every keystroke is how human review stays in the loop without becoming a full-time job. Verification scales when it is layered, not when it is heroic.

⚠️ What stays human, honestly

The structure verifies the process; it does not verify the judgement. Is this the right analysis for the question? Does the result make physical sense for a river? Is the effect real or an artefact of the small sample after exclusions? Those are research judgements, and no amount of logging makes them the agent's job. The Week 7 silent-error problem and the Week 9 plausible-but-wrong problem do not disappear because the work is well-documented — a beautifully reproducible analysis can still be reproducibly wrong. And there is a cost to all of this: tokens, time, and the genuine review effort every agentic result demands. Budget for it.

👤 From the instructor's own practice

“In my earlier workflows I could go off on tangents for days, sometimes weeks, thinking that something was interesting, when in fact we were just chasing a bug in another paper. In this case the agent thought that the bug-chasing was the interesting thing we were trying to do, and I didn’t have the right checks in place to understand that that’s what it thought. No amount of looking at individual changes would have caught that: nothing was wrong with any one line; the whole direction was wrong.”

“I’ve had lots of smaller errors too, and the genuine issue is that when so much is changing all the time, you often can’t check each change as it happens (as you wouldn’t if you were supervising a student). So I’ve stopped trying to be careful by watching everything. The carefulness has to live somewhere else. It’s in the gates I set at the start so a tangent gets stopped early, in keeping the structure so that any mistake can be undone, and in checking the things that actually matter rather than every step along the way. I’m still trying to figure out how to do this well at the volume the tools now make possible.”

📄 Disclosure: The Reproducible Folder Is the Disclosure

Week 4 and Week 11.3 asked you to disclose your AI use. Week 11.3 also delivered the bracing finding that, against roughly 70% of journals having an AI policy, only about 0.1% of papers actually disclose (He & Bu, 2026, PNAS). Most disclosure, where it happens at all, is a vague sentence: “AI tools were used in the preparation of this work.” That sentence tells a reader almost nothing.

A reproducible project folder flips the problem. You do not need a sentence that gestures at what the AI did, because the folder shows it: the CLAUDE.md states the rules the agent worked under; the decision log records every judgement and who made it; the scripts are the method; the Git history is the sequence of events; the pre-registration proves the standard predated the result. The honest answer to “what did the AI do, and what did you decide?” is not a paragraph — it is “here is the folder.” That is what good disclosure actually looks like, and it is a higher standard than almost any journal currently asks for.

🎓 The capstone connection

If you took this track alongside the course, the Week 12 capstone is where it can land: a reproducible project folder is the natural form for the concrete piece of work the capstone asks you to commit to. You do not just argue that you would use AI responsibly in your research — you show the structure in which you would.

✅ What to take from the Advanced Track

Claude Code is a model with hands in your project folder, and the work lives in files, not in the chat. Reproducibility is a different goal from verification, and the agent can be instructed to serve it — logging decisions, keeping raw data sacred, holding you to a pre-registration — better than unaided human discipline usually manages.

The CLAUDE.md research-habits template is where the course's integrity becomes a rule the agent follows. Pre-registration is where you refuse to fool yourself. The reproducible folder is where verification, disclosure, and trust all meet — the same folder answers all three.

And the disposition underneath has not changed since Lesson A: delegate, then verify; the tool amplifies your practice rather than replacing your judgement; and you scaffold in proportion to what the work has to bear. Use the heavy machinery where the stakes are real, lightly where they are not, and honestly everywhere.